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Abstract 

Previous  multi-document  relationship  ex¬ 
traction  and  fusion  research  has  focused 
on  single  relationships.  Shifting  the  fo¬ 
cus  to  multiple  relationships  allows  for  the 
use  of  mutual  constraints  to  aid  extrac¬ 
tion.  This  paper  presents  a  fusion  method 
which  uses  a  probabilistic  database  model 
to  pick  relationships  which  violate  few 
constraints.  This  model  allows  improved 
performance  on  constructing  corporate 
succession  timelines  from  multiple  doc¬ 
uments  with  respect  to  a  multi-document 
fusion  baseline. 

1  Introduction 

Single  document  information  extraction  of  named 
entities  and  relationships  has  received  much  atten¬ 
tion  since  the  MUC  evaluations1  in  the  mid-90s  (Ap- 
pelt  et  al.,  1993;  Grishman  and  Sundheim,  1996). 
Recently,  there  has  been  increased  interest  in  the 
extraction  of  named  entities  and  relationships  from 
multiple  documents,  since  the  redundancy  of  infor¬ 
mation  across  documents  has  been  shown  to  be  a 
powerful  resource  for  obtaining  high  quality  infor¬ 
mation  even  when  the  extractors  have  access  to  little 
or  no  training  data  (Etzioni  et  al.,  2004;  Hasegawa 
et  al.,  2004).  Much  of  the  recent  work  in  multi¬ 
document  relationship  extraction  has  focused  on 
the  extraction  of  isolated  relationships  (Agichtein, 
2005;  Pasca  et  al.,  2006),  but  often  the  goal,  as  in 

1  http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ 


single  document  tasks  like  MUC,  is  to  extract  a  tem¬ 
plate  or  a  relational  database  composed  of  related 
facts. 

With  databases  containing  multiple  relationships, 
the  semantics  of  the  database  impose  constraints 
on  possible  database  configurations.  This  paper 
presents  a  statistical  method  which  picks  relation¬ 
ships  which  violate  few  constraints  as  measured  by 
a  probabilistic  database  model.  The  constraints  are 
hai'd  constraints,  and  robust  estimates  are  achieved 
by  accounting  for  the  underlying  extraction/fusion 
uncertainty. 

This  method  is  applied  to  the  problem  of  con¬ 
structing  management  succession  timelines  which 
have  a  rich  set  of  semantic  constraints.  Using  con¬ 
straints  on  probabilistic  databases  yields  F-Measure 
improvements  of  5  to  18  points  on  a  per-relationship 
basis  over  a  state-of-the-art  multi-document  extrac¬ 
tion/fusion  baseline.  The  constraints  proposed  in 
this  paper  are  used  in  a  context  of  minimally  super¬ 
vised  information  extractors  and  present  an  alterna¬ 
tive  to  costly  manual  annotation. 

2  Semantic  Constraints  on  Databases 

This  paper  considers  management  succession 
databases  where  each  record  has  three  fields:  a 
CEO’s  name  and  the  start  and  end  years  for  that 
person’s  tenure  as  CEO  (Table  1  Column  1).  Each 
record  is  represented  by  three  binary  logical  pred¬ 
icates:  ceo(c,x),  start{x ,y1),  end(x,y2),  where  c  is 
a  company,  x  is  a  CEO’s  name,  and  y1  and  y2  are 
years.2 

2  All  of  the  relationships  in  this  paper  are  defined  to  be  bi¬ 
nary  relationships.  When  extracting  relationships  of  higher  ar- 
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Explicit 

Relationships 

Implicit 

Relationships 

Logical  Constraints 
(a  partial  list) 

ceo{c,  x) 
start(x,  y) 
end(x,  y) 

precedes(xl  ,x2) 
inofficeix,  y) 
p  re  date  six1 ,  x2) 

ceo{c,xL),  precedes{xl ,x2)  ceo(c,x2),  end(xL,y),  startix2,  y) 

start{x,yl),  inoffice(x,y2),  end{x,  y3)  =$■  yl  <  y2  <  y3 
inofficeix1,  yl),  inoffice{x2,  y2) ,  y1  <  y2  =>•  predate s{xl ,x2) 
precedesix1  ,x2) ,  inofficeix1  ,yl),  inofficeix2  ,y2)  =)>  y1  <  y2 

Table  1:  Database  semantics  can  provide  1)  a  method  to  augment  the  explicit  relationships  in  the  database 
with  implicit  relationships  and  2)  logical  constraints  on  these  explicit  and  implicit  relationships.  In  the  above 
table,  c  is  a  company,  xl  is  a  person,  and  yl  is  a  time. 


In  this  setting,  database  semantics  allow  for  the 
derivation  of  other  implicit  relationships  from  the 
database:  the  immediate  predecessor  of  a  given  CEO 
{precedes^1  ,x2)),  all  predecessors  of  a  given  CEO 
ipredatesix1  ,x2))  and  all  years  the  CEO  was  in  of¬ 
fice  ( inofficeix, y 3)),  where  x~  is  a  CEO’s  name  and 
t3  a  year  (Table  1  Column  2). 

These  implicit  relationships  and  the  original  ex¬ 
plicit  relationships  arc  governed  by  a  series  of  se¬ 
mantic  relations  which  impose  constraints  on  the 
permissible  database  configurations  (Table  1  Col¬ 
umn  3).  For  example,  it  will  always  be  true  that  a 
CEO’s  start  date  precedes  their  end  date: 

Vx  :  start(x,  y1),  end(x,  y2)  =>  y1  <  y2  . 

Multi-document  extraction  of  single  relationships 
exploits  redundancy  and  variety  of  expression  to  ex¬ 
tract  accurate  information  from  across  many  docu¬ 
ments.  However,  these  arc  not  the  only  benefits  of 
extraction  from  a  large  document  collection.  As  well 
as  being  rich  in  redundant  information,  large  docu¬ 
ment  collections  also  contain  a  wealth  of  secondary 
relationships  which  are  related  to  the  relationship  of 
interest  via  constraints  as  described  above.  These 
secondary  relationships  can  yield  benefits  to  aug¬ 
ment  those  achieved  by  redundancy. 

3  Multi-Document  Database  Fusion 

There  arc  typically  two  steps  in  the  extraction  of  sin¬ 
gle  relationships  from  multiple  documents.  In  the 
first  step,  a  relationship  extractor  goes  through  the 
corpus,  finds  all  possible  relationships  r  in  all  sen¬ 
tences  s  and  gives  them  a  score  p(r\s).  Next,  the 

ity,  typically  binary  relationships  are  combined  (McDonald  et 
al„  2005). 


relationships  are  fused  across  sentences  to  generate 
one  score  <pr  for  each  relationship. 

This  paper  proposes  a  third  step  which  combines 
the  fusion  scores  across  relationships.  This  section 
first  presents  a  probabilistic  database  model  gener¬ 
ated  from  fusion  scores  and  then  shows  how  to  use 
this  model  for  multi-document  fusion. 


3.1  A  Probabilistic  Database  Model 

A  relationship  r  is  defined  to  be  a  3-tuple  rt’a,b  = 
r(t ,  a,  b),  where  t  is  the  type  of  the  relationship  (e.g. 
start),  and  a  and  b  arc  the  arguments  of  the  binary 
relationship.3 

To  construct  a  probabilistic  database  for  a  given 
corpus,  the  weights  generated  in  relationship  fusion 
arc  normalized  to  provide  the  conditional  probability 
of  a  relationship  given  its  type: 


p(rr’b\ti)  = 


(j)  t,a,b 


t,a,b 


where  6r  is  the  fusion  score  generated  by  the  extrac¬ 
tion/fusion  system.4  By  applying  a  prior  over  types 
p(t),  a  distribution  p(r\,t{)  can  be  derived.  Given 
strong  independence  assumptions,  the  probability  of 
an  ordered  database  configuration  R  =  n..n  of  types 

h..n  is: 


n 

p(ri..niti..n)  =  W_p[ri,ti) .  (1) 

i=  1 


3Forreadibility  in  future  examples,  “a”  and  “b”  are  replaced 
by  the  types  of  their  arguments.  For  example,  for  start  the  year 
in  which  the  CEO  starts  is  referred  to  as  ryear . 

4The  following  fusion  method  does  not  depend  on  a  par¬ 
ticular  extraction/fusion  architecture  or  training  methodology, 
merely  this  conditional  probability  distribution. 


As  proposed,  the  model  in  Equation  1  is  faulty 
since  the  relationships  in  a  database  arc  not  inde¬ 
pendent.  Given  a  set  of  database  constraints,  certain 
database  configurations  arc  illegal  and  should  be  as¬ 
signed  zero  probability.  To  address  this,  the  model 
in  Equation  1  is  augmented  with  constraints  that  ex¬ 
plicitly  set  the  probability  of  a  database  configura¬ 
tion  to  zero  when  they  arc  violated. 

A  database  constraint  is  a  logical  formula 
v(ri..n (?/))>  where  n (77)  is  the  arity  of  the  constraint 
r).  For  the  constraints  presented  in  this  paper,  all 
constraints  77  arc  modeled  with  two  terms  77“  and  77 @ 
where: 

V(ri.. *(„))  =  (V(?T.. 7r(r,))  =►  • 

For  a  set  of  relationships,  a  constraint  holds  if'  r/(-) 
is  true,  and  the  constraint  applies  if  rja(-)  is  true.  A 
constraint  ry(-)  can  only  be  violated  (false)  when  the 
constraint  applies,  since:  (false  =>•  X)  =  true. 

In  application  to  a  database,  each  constraint  77  is 
quantified  over  the  database  to  become  a  quantified 
constraint  r/n  n.  For  example,  the  constraints  that  a 
person’s  start  date  must  come  before  their  end  date  is 
universally  quantified  over  all  pairs  of  relationships 
in  a  configuration  R  =  r\__n: 

Vn..n  =  Vri,r2  G  R  :  77(71,  r2)  = 

(r\  =  start ,  r\  =  end ,  r“°  =  r^°) 
=>  (rf “r  <  rfar ) . 

This  constraint  applies  to  start  and  end  relationships 
whose  CEO  argument  matches  and  is  violated  when 
the  years  arc  not  in  order.  If  the  quantified  constraint 
r]ri  n  is  true  for  a  given  database  configuration  n..n 
then  it  holds. 

To  ensure  that  only  legal  database  configurations 
are  assigned  positive  probabilities,  Equation  1  is 
augmented  with  a  factor 

1  if  rirln  holds 
0  otherwise  . 

To  include  a  constraint  77,  the  database  model  in 
Equation  1  is  extended  to  be: 


where  Z  is  the  partition  function  and  corresponds  to 
the  total  probability  of  all  database  configurations.  A 
set  of  constraints  771 =  z/1  ..rp  can  be  integrated 
similarly: 

Pvr.Q(ri..n,ti..n)  =  ^ 

(2) 

With  these  added  constraints,  the  probabilistic 
database  model  assigns  non-zero  probability  only  to 
databases  which  don’t  violate  any  constraints. 

3.2  Constraints  on  Probabilistic  Databases  for 
Relationship  Rescoring 

Though  the  constrained  probabilistic  database 
model  in  Equation  2  is  theoretically  appealing,  it 
would  be  infeasible  to  calculate  its  partition  func¬ 
tion  which  requires  enumeration  of  all  legal  2n 
databases.  This  section  proposes  two  methods  for 
re-scoring  relationships  with  regards  to  how  likely 
they  arc  to  be  present  in  a  legal  database  configu¬ 
ration  using  the  model  proposed  above.  The  first 
method  is  a  confidence  estimate  based  on  how  likely 
it  is  that  77  holds  for  a  given  relationship  rq : 

Aj?(7’i,  G)  =  Rp(r2..„,t2..n) 

(n  rl?  Kn, to) 

(nSVn.y) 

_  ^2r2_Mv)P’n(rl..n,tl..n) 

~  ^r2..„MP(r l..n,tl..n)  ’ 

where  the  expectation  that  the  constraint  holds 
is  equivalent  to  the  likelihood  ratio  between  the 
database  probability  models  with  and  without  con¬ 
straints.  In  effect,  this  model  measures  the  expec¬ 
tation  that  the  constraint  holds  for  a  finite  database 
“look-ahead”  of  size  77(77)  —  1- 

With  this  method,  for  a  constraint  to  reduce  the 
confidence  in  a  particular  relationship  by  half,  half 
of  all  configurations  would  have  to  violate  the  con¬ 
straint.5  Since  inconsistencies  are  relatively  rare,  for 
a  given  relationship  Ar?(r,  t)  ~  1  (i.e.  almost  all 
small  databases  arc  legal). 


Pr/ir l..ri) 


’Assuming  equal  probability  for  all  relationships. 


To  remedy  this,  another  factor  0'y  is  defined  simi¬ 
larly  to  (f>'K  except  that  it  takes  a  value  of  1  only  if  the 
constraint  applies  to  that  database  configuration.  An 
applicability  probability  model  is  then  defined  as: 


Pair l..nj  U..n) 


The  second  confidence  estimate  is  based  on  how 
likely  it  is  that  the  constraint  is  holds  in  cases  where 
it  applies  (i.e.  is  not  violated): 


A-T],a (t”l  j  U) 


=  E, 


E 


r2..ir(ri) 


n  iStP(ri,uj) 


^*1. .77(77) 


E 


^2.  .77(77) 


(n i=2P(ruU))  <t>\ 


^1. .77(77) 


When  the  constraint  doesn’t  apply  it  cannot  be  vio¬ 
lated,  so  this  confidence  estimate  ignores  those  con¬ 
figurations  that  can’t  be  affected  by  the  constraint. 

Recall  that  A v(r,t)  is  the  likelihood  ratio  be¬ 
tween  the  probability  of  configurations  in  which  r 
holds  for  constraint  77  and  all  configurations.  In  con¬ 
trast,  A ,hCt(r,t)  is  the  likelihood  ratio  between  the 
database  configurations  where  r  applies  and  holds 
for  rj  and  the  database  configurations  where  77  ap¬ 
plies.  In  the  later  ratio,  for  confidence  in  a  particular 
relationship  to  be  cut  in  half,  only  half  of  the  con¬ 
figurations  which  might  actually  contain  an  incon¬ 
sistency  would  be  required  to  produce  a  violation.6 
As  a  result,  A,ha(r,  t  )  gives  a  much  higher  penalty  to 
relationships  which  create  inconsistencies  than  does 
Av{r,t). 

In  order  to  apply  multiple  constraints,  indepen¬ 
dent  database  look-aheads  are  generated  for  each 
constraint  q: 


Avi..Q  ai..Q(ri,ti)  =  ]^[  A^q tCeq (r\ ,  t\) . 


For  a  particular  relationship  type,  these  confidence 
scores  are  calculated  and  then  used  to  rank  the  rela- 

6For  example,  for  a  start  relationship  and  the  constraint  that 
a  CEO  must  start  before  they  end,  this  method  would  only  ex¬ 
amine  configurations  of  one  start  and  one  end  relationship  for 
the  same  CEO.  The  confidence  in  a  particular  start  date  would 
be  halved  if  half  of  the  proposed  end  dates  for  a  given  CEO 
occurred  before  it. 


tionships  via: 

crii..Qai..Q(ri,tl)  =  p{r\,t\)  n 

<7 

(3) 

Databases  with  different  precision/recall  trade  offs 
can  be  selected  by  descending  the  ranked  list.7 

4  Experiments 

In  order  to  test  the  fusion  method  proposed  above, 
human  annotators  manually  constructed  truth  data 
of  complete  chief  executive  histories  for  1 8  Fortune- 
500  companies  using  online  resources.  Extraction 
from  these  documents  is  particularly  difficult  be¬ 
cause  these  data  have  vast  differences  in  genre  and 
style  and  are  considerably  noisy.  Furthermore,  the 
task  is  complicated  to  start  with.8 

A  corpus  was  created  for  each  company  by  is¬ 
suing  a  Google  query  for  “CEO-of-Company  OR 
Company-CEO”,  and  collecting  the  top  ranked  doc¬ 
uments,  generating  up  to  1000  documents  per  com¬ 
pany.  The  data  was  then  split  randomly  into  training, 
development  and  testing  sets  of  6,  4,  and  8  compa¬ 
nies. 


Training  :  Anheuser-Busch,  Hewlett-Packard, 

Lenner,  McGraw-Hill,  Pfizer,  Raytheon 
Dev.  :  Boeing,  Heinz,  Staples,  Textron 

Test :  General  Electric,  General  Motors, 

Gannett,  The  Home  Depot,  IBM, 
Kroger,  Sears,  UPS 


Ground  truth  was  created  from  the  entire  web,  but 
since  the  corpus  for  each  company  is  only  a  small 
web  snapshot,  the  experimental  results  are  not  simi¬ 
lar  to  extraction  tasks  like  MUC  and  ACE  in  that  the 
corpus  is  not  guaranteed  to  contain  the  information 
necessary  to  build  the  entire  database.  In  particular, 

7  One  thing  to  note  is  that  since  all  relationships  are  given 
confidence  estimates  separately,  this  process  may  result  ulti¬ 
mately  in  a  database  where  constraints  are  violated.  A  potential 
solution,  which  is  not  explored  here,  would  be  to  incrementally 
add  relationships  to  the  database  from  the  ranked  list  only  if 
their  addition  doesn’t  make  the  database  inconsistent. 

8For  example,  in  certain  companies,  the  title  of  the  chief 
executive  has  changed  over  the  years,  often  going  from  “Presi¬ 
dent”  to  “Chief  Executive  Officer”.  To  make  things  more  com¬ 
plicated,  after  the  change,  the  role  of  “President”  may  still  hang 
on  as  a  subordinate  to  the  CEO ! 


1)  Only  one  start  or  end  per  person. 

Vri,  r2  :  rj(n,  rf)  =  ( r\vpe  =  rtype  =  ( start  U  end),  rf633  =  r2eo)  =>  [r\ear  =  r2ear) 

2)  Only  a  CEO’s  start  or  end  dates  belong  in  the  database. 

V?’i3r2  :  rj(ri,  r2)  =  ( r[ype  =  start  U  end,  r2ype  =  ceo )  =>■  (rJeo  =  r2eo) 

3)  Start  dates  come  before  end  dates. 

Vri,  r2  :  »7(ri,  rf)  =  (rj33336  =  start,  rtype  =  end,  rieo  =  r);60)  =>■  (r\ear  <  r2ear) 

4)  Can’t  be  in  the  middle  of  someone  else’s  tenure. 

Vn,r2,  r3  :  ri{ri,r2,rs)  =  ( riype  =  start  U  inoffice,  r2ype  =  end  U  inoffice,  r3ype  =  sfarf  U  inoffice  U  end, 
rf  °  =  rr°  A  rf  °,  rf ar  <  ryear )  =>  (ryear  <  ryear  U  rf  ar  >  ryear) 

5)  CEO’s  are  only  in  office  after  their  start. 

Vri,r2  :  r)(ri,rf)  =  (rj33336  =  start, r2ype  =  inof  fice,rleo  =  r f33)  =t-  [ryear  <  ryear) 

6)  CEO’s  are  only  in  office  before  their  end. 

Vn,r2  :  t?(ri,r2)  =  ( r\vpe  =  inoffice ,rlype  =  end,r?°  =  rf°)  =>  (rfar  <  rfar) _ 

7)  Someone’s  end  is  the  same  as  their  successor’s  start. 

Vri,  r2,  r3  :  Jj(n,  r2,  r3)  = 

(rtppe  =  end,  rtppe  =  start,  r^pe  =  precedes,  rjeo  =  rf3irst,  r?°  =  rs3econd)  =»  (rfar  =  rfar) _ 

8)  All  of  the  someone’s  dates  (start,  inoffice,  end)  are  before  their  successors. 

Vri,  r2,  r3  :  r/(n,  r2,  r3)  =  (ry3336  =  sfarf  U  end  U  inoffice,  r2ype  =  start  U  inoffice  U  end,  rj33336  =  precedes, 
r1eo  =  r{irst,r?°  =  rs3econd )  =»  (rf ar  <  rf ar) 

9)  Only  CEO  succession  in  the  database. 

VriBri,  r2  :  t?(ri,r2,  r3)  =  (rf336  =  precedes,  rf336  =  rf336  =  ceo)  =>  (r{v rst  =  rf33,  rfecond  =  r3e°) 


Table  2:  For  a  CEO  succession  database  like  the  one  presented  in  Table  1,  the  above  constraints  must  hold 
if  the  database  is  consistent. 


many  CEOs  from  pre-Internet  years  were  either  in¬ 
frequently  mentioned  or  not  mentioned  at  all  in  the 
database.9  In  the  following  experiments,  recall  is  re¬ 
ported  for  facts  that  were  retrieved  by  the  extraction 
system. 

4.1  Relationship  Extraction  and  Fusion 

A  two-class  maximum-entropy  classifier  was  trained 
for  each  relationship  type.  Each  classifier  takes  a 
sentence  and  two  marked  entities  (e.g.  a  person  and 
a  year)10  and  gives  the  probability  that  a  relation¬ 
ship  between  the  two  entities  is  supported  by  the 
sentence.  For  each  relationship  type,  one  of  the  ele¬ 
ments  is  designated  as  the  “hook”  in  order  to  gener¬ 
ate  likely  negative  examples.11  In  training,  all  entity 
pairs  are  collected  from  the  corpus.  The  pairs  whose 
“hook”  element  doesn’t  appear  in  the  database  are 
thrown  out.  The  remaining  pairs  are  then  marked 

9Another  consequence  is  that  assessing  the  effectiveness  of 
the  relationships  extraction  on  a  per-extraction  basis  is  difficult. 
Because  there  are  no  training  sentences  where  it  is  known  that 
the  sentence  contains  the  relationship  of  interest,  grading  per- 
extraction  results  can  be  deceptive. 

10The  person  tagger  used  is  the  named-entity  tagger  from 
OpenNLP  tools  and  the  year  tagger  simply  finds  any  four  digit 
numbers  between  1950  and  2010. 

11  For  the  CEO  relationship,  the  company  was  taken  to  be 
the  hook.  For  the  other  relationships  the  hook  was  the  primary 
CEO. 


by  exact  match  to  the  database.  In  testing,  the  rela¬ 
tionship  extractor  yields  the  probability  p(r|s)  of  an 
entity  pair  relationship  r  in  a  particular  sentence  s. 

The  features  used  in  the  classifier  are:  unigrams 
between  the  given  information  and  the  target,  dis¬ 
tance  in  words  between  the  given  information  and 
the  target,  and  the  exact  string  between  the  given  in¬ 
formation  and  the  target  (if  less  that  3  words  long). 

After  extraction  from  individual  sentences,  the  re¬ 
lationships  are  fused  together  such  that  there  is  one 
score  for  each  unique  entity  pair.  In  the  case  of  per¬ 
son  names,  normalization  was  performed  to  merge 
coreferent  but  lexically  distinct  names  (e.g.  “Phil 
Condit”  and  “Philip  M.  Condit”). 

In  the  following  experiments,  the  baseline  fusion 
score  is: 

</>r  = 

s 

4.2  Experimental  Results 

Given  the  management  succession  database  pro¬ 
posed  in  Section  2,  Table  2  enumerates  a  set  of  quan¬ 
tified  constraints.  Information  extraction  and  fusion 
were  run  separately  for  each  company  to  create  a 
probabilistic  database.  In  this  section,  various  con¬ 
straint  sets  are  applied,  either  individually  or  jointly, 
and  evaluated  in  two  ways.  The  first  measures  per- 
relationship  precision/recall  using  the  model  pro- 


Figure  1:  Precision/Recall  curve  for  start(x,t)  rela¬ 
tionships.  The  joint  constraint  “2,3,5, 8”  is  the  best 
performing,  even  though  constraints  “3”  and  “8” 
(not  pictured)  alone  don’t  perform  well. 

posed  and  the  second  looks  at  the  precision/recall 
of  a  heterogeneous  database  with  many  relationship 
types.  Both  evaluations  examine  the  ranked  lists  of 
relationships,  where  the  relationships  are  ranked  by 
rescoring  via  constraints  on  probabilistic  databases 
(Equation  3)  and  compared  to  the  baseline  fusion 
score  (Equation  4).  The  evaluations  use  two  stan¬ 
dard  metrics,  interpolated  precision  at  recall  level  i 
(  Pr,  ),  and  MaxFl: 

P rt,  =  max  Pr,  , 

2 

MaxFl  =  max  — . 

*  P - h  TT 

Ri 

Figures  1,  2,  and  3  show  precision/recall  curves 
for  the  application  of  various  sets  of  constraints.  Ta¬ 
ble  3  lists  the  MaxFl  scores  for  each  of  the  con¬ 
straint  valiants.  For  start  and  end,  the  majority  of 
constraints  are  beneficial.  For  precedes ,  only  the 
constraint  that  improved  performance  constraints 
both  people  in  the  relationship  to  be  CEOs.  Across 
all  relationships,  performance  is  hurt  when  using  the 
constraint  that  there  could  only  be  one  relationship 
of  each  type  for  a  given  CEO.  The  reason  behind 
this  is  that  the  confidence  estimate  based  on  this 
constraint  favors  relationships  with  few  competitors, 
and  those  relationships  are  typically  for  people  who 
are  infrequent  in  the  corpus  (and  therefore  unlikely 
to  be  CEOs). 

The  best-performing  constraint  sets  yield  between 
5  and  18  points  of  improvement  on  Max  FI  (Ta¬ 
ble  3).  Surprisingly,  the  gains  from  joint  con- 


Figure  2:  Precision/Recall  curve  for  end(x,t)  rela¬ 
tionships  alone.  The  joint  constraint  “2,6”  is  the  best 
performing. 


Figure  3:  Precision/Recall  for  precedes(x,y)  rela¬ 
tionships  alone.  Though  the  constraint  “First  Before 
Second  (8)”  helps  performance  on  start(x,t)  relation¬ 
ships,  the  only  constraint  which  aids  here  is  “Only 
CEOs  Succession  (9)”. 


straints  are  sometimes  more  than  their  additive 
gains.  “2,3,5, 6, 8”  is  6  points  better  for  the  start  rela¬ 
tionship  than  “2,3,5, 6”,  but  the  gains  from  “8”  alone 
are  negligible. 

These  performance  gains  on  the  individual  rela¬ 
tionship  types  also  lead  to  gains  when  generating  an 
entire  database  (Figure  4).  The  highest  performing 
constraint  is  the  “CEOs  Only  (2)”  constraint,  which 
outperforms  the  joint  constraints  of  the  previous  sec¬ 
tion.  One  reason  the  joint  constraints  don’t  do  as 
well  here  is  that  each  constraint  makes  the  confi¬ 
dence  estimate  smaller  and  smaller.  This  doesn’t 
have  an  effect  when  judging  the  relationship  types 
individually,  but  when  combining  the  relationships 
results,  the  fused  relationships  types  {start,  end)  be- 


Constraint  Set 

Start 

Man 

End 

:  FI 

Pre. 

DB 

0  (baseline) 

31.2 

35.8 

34.5 

37.9 

Only  One  (1) 

10.5 

7.2 

- 

38.1 

CEOs  Only  (2)  or  (9) 

43.3 

39.4 

39.4 

42.9 

Start  Before  End  (3) 

40.8 

32.8 

- 

40.9 

No  Overlaps  (4) 

31.5 

35.9 

- 

36.8 

Inoffice  After  Start  (5) 

32.5 

- 

- 

38.2 

Inoffice  Before  End  (6) 

- 

36.5 

- 

37.4 

End  is  Start  (7) 

7.3 

8.0 

20.7 

39.2 

First  before  Second  (8) 

31.4 

35.6 

26.3 

38.1 

2,5,6 

43.3 

40.8 

- 

42.7 

2,3,5, 6 

43.9 

43.3 

- 

42.2 

2, 3, 5, 6, 8 

49.3 

43.9 

26.3 

40.9 

Table  3:  Max  FI  scores  for  three  relationships 
Start(x,t),  End(x,t)  and  Precedes(x,y))  in  isolation 
and  within  the  context  of  whole  database  DB.  The 
joint  constraints  perform  best  for  the  explicit  rela¬ 
tionships  in  isolation.  Using  constraints  on  implicit 
derived  fields  (Inoffice  and  Precedes)  provides  ad¬ 
ditional  benefit  above  constraints  strictly  on  explicit 
database  fields  (start,  end,  ceo). 


come  artificially  lower  ranked  than  the  unfused  rela¬ 
tionship  type  (ceo).  The  best  performing  contrained 
probabilistic  database  approach  beats  the  baseline 
by  5  points. 

5  Related  Work 

Techniques  for  information  extraction  from  min¬ 
imally  supervised  data  have  been  explored  by 
Brin  (1998),  Agichtein  and  Gravano  (2000),  and 
Ravichandran  and  Hovy  (2002).  Those  techniques 
propose  methods  for  estimating  extractors  from  ex¬ 
ample  relationships  and  a  corpus  which  contains  in¬ 
stances  of  those  relationships. 

Nahm  and  Mooney  (2002)  explore  techniques  for 
extracting  multiple  relationships  in  single  document 
extraction.  They  learn  rules  for  predicting  certain 
fields  given  other  extracted  fields  (i.e.  a  someone 
who  lists  Windows  as  a  specialty  is  likely  to  know 
Microsoft  Word). 

Perhaps  the  most  related  work  to  what  is  pre¬ 
sented  here  is  previous  research  which  uses  database 
information  as  co-occurrence  features  for  informa¬ 
tion  extraction  in  a  multi-document  setting.  Mann 


Figure  4:  Precision/Recall  curve  for  whole  database 
reconstruction.  Performance  curves  using  con¬ 
straints  dominate  the  baseline. 


and  Yarowsky  (2005)  present  an  incremental  ap¬ 
proach  where  co-occurrence  with  a  known  relation¬ 
ship  is  a  feature  added  in  training  and  test.  Cu- 
lotta  et  al.  (2006)  introduce  a  data  mining  approach 
where  discovered  relationships  from  a  database  arc 
used  as  features  in  extracting  new  relationships.  The 
database  constraints  presented  in  this  paper  provide 
a  more  general  framework  for  jointly  conditioning 
multiple  relationships.  Additionally,  this  constraint- 
based  approach  can  be  applied  without  special  train¬ 
ing  of  the  extraction/fusion  system. 

In  the  context  of  information  fusion  of  single  rela¬ 
tionships  across  multiple  documents,  Downey  et  al. 
(2005)  propose  a  method  that  models  the  probabili¬ 
ties  of  positive  and  negative  extracted  classifications. 
More  distantly  related,  Sutton  and  McCallum  (2004) 
and  Finkel  et  al.  (2005)  propose  graphical  models 
for  combining  information  about  a  given  entity  from 
multiple  mentions. 

In  the  field  of  question  answering,  Prager  et  al. 
(2004)  answer  a  question  about  the  list  of  composi¬ 
tions  produced  by  a  given  subject  by  looking  for  re¬ 
lated  information  about  the  subject’s  birth  and  death. 
Their  method  treats  supporting  information  as  fixed 
hai'd  constraints  on  the  original  questions  and  are  ap¬ 
plied  in  an  ad-hoc  fashion.  This  paper  proposes  a 
probabilistic  method  for  using  constraints  in  the  con¬ 
text  of  database  extraction  and  applies  this  method 
over  a  larger  set  of  relations. 

Richardson  and  Domingos  (2006)  propose  a 
method  for  reasoning  about  databases  and  logical 
constraints  using  Markov  Random  Fields.  Their 


model  applies  reasoning  stalling  from  a  known 
database.  In  this  paper  the  database  is  built  from  ex¬ 
traction/fusion  of  relationships  from  web  pages  and 
contains  a  significant  amount  of  noise. 

6  Conclusion 

This  paper  has  presented  a  probabilistic  method  for 
fusing  extracted  facts  in  the  context  of  database  ex¬ 
traction  when  there  exist  logical  constraints  between 
the  fields  in  the  database.  The  method  estimates  the 
probability  than  the  inclusion  of  a  given  relationship 
will  violate  database  constraints  by  taking  into  ac¬ 
count  the  uncertainty  of  the  other  extracted  relation¬ 
ships.  Along  with  the  relationships  explicitly  listed 
in  the  database,  constraints  arc  formed  over  implicit 
fields  directly  recoverable  from  the  explicit  listed  re¬ 
lationships. 

The  construction  of  CEO  succession  timelines  us¬ 
ing  minimally  trained  extractors  from  web  text  is  a 
particularly  challenging  problem  because  of  noise 
resulting  from  the  wide  variation  in  genre  in  the  cor¬ 
pora  and  errors  in  extraction.  The  use  of  constraints 
on  probabilistic  databases  is  effective  in  resolving 
many  of  these  errors,  leading  to  improved  precision 
and  recall  of  retrieved  facts,  with  F-measure  gains  of 
5  to  1 8  points. 

The  method  presented  in  this  paper  combines 
symbolic  and  statistical  approaches  to  natural  lan¬ 
guage  processing.  Logical  constraints  arc  made 
more  robust  by  taking  into  account  the  uncertainty 
of  the  extracted  information.  An  interesting  area 
of  future  work  is  the  application  of  data  mining  to 
search  for  appropriate  constraints  to  integrate  into 
this  model. 
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